Chapter 1: Tools and methods for open and reproducible research

About first week assignment

This is the first assignment of the Introduction to Open Data Science 2018 course. This week included various git / rstudio / rmarkdown exercises.

My Github for all exercises and source code


Chapter 2: Regression and model validation

library(data.table)

# Read in learning2014 created with create_learning2014.R
learning2014<-fread("learning2014.txt")

Dimensions and variable characteristics of learning2014

# Number of records and number of variables in the dataset
dim(learning2014)
## [1] 166   7
# Variable types (integers are integer variables, numerical are continuous floating number variables)
str(learning2014)
## Classes 'data.table' and 'data.frame':   166 obs. of  7 variables:
##  $ Gender  : int  2 1 2 1 1 2 1 2 1 2 ...
##  $ Age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ Attitude: int  37 31 25 35 37 38 35 29 38 21 ...
##  $ Deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ Stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ Surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ Points  : int  25 12 24 10 22 21 21 31 24 26 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Gender: Male = 1, Female = 2
Age: Age (in years) derived from the date of birth
Attitude: Global attitude toward statistics
Deep: Deep approach
Stra: Strategic approach
Surf: Surface approach
Points: Yhteispisteet (max kaikista)

Variable summary statistics

summary(learning2014)
##      Gender           Age           Attitude          Deep      
##  Min.   :1.000   Min.   :17.00   Min.   :14.00   Min.   :1.583  
##  1st Qu.:1.000   1st Qu.:21.00   1st Qu.:26.00   1st Qu.:3.333  
##  Median :2.000   Median :22.00   Median :32.00   Median :3.667  
##  Mean   :1.663   Mean   :25.51   Mean   :31.43   Mean   :3.680  
##  3rd Qu.:2.000   3rd Qu.:27.00   3rd Qu.:37.00   3rd Qu.:4.083  
##  Max.   :2.000   Max.   :55.00   Max.   :50.00   Max.   :4.917  
##       Stra            Surf           Points     
##  Min.   :1.250   Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.625   1st Qu.:2.417   1st Qu.:19.00  
##  Median :3.188   Median :2.833   Median :23.00  
##  Mean   :3.121   Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.625   3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :5.000   Max.   :4.333   Max.   :33.00

Histograms of variables to visualize distribution shapes

par(mfrow=c(3,3))
sapply(1:ncol(learning2014),function(i) hist(learning2014[[i]], breaks=20, xlab=NA,main=names(learning2014)[i]))

Density plots

par(mfrow=c(3,3))
sapply(1:ncol(learning2014),function(i) plot(density(learning2014[[i]]), xlab=NA,main=names(learning2014)[i]))

Age has a long right tail. Attitude, Deep, Stra, and Surf could be considered approximately normally distributed. From Surf and Stra histograms we can clearly see that they are sum variables and have multiple modes (peaks in the histogram).

Pairwise relationships between variables

pairs(learning2014)

Visually at least Attitude seems to have relationship with Points. To further investigate this, let’s additionally draw a correlation plot.

library(corrplot)
## corrplot 0.84 loaded
corrplot(cor(learning2014, method="spearman"), type="lower")

Attitude clearly correlates with Points. Furthermore, Age, gender, stra, and surf all seem to have a small correlation. Given we are allowed to choose only three variables for this exercise, age and gender with attitude seem most interesting combination.

Linear regression model

Lets try first a model with Age, Gender (basic demographic covariates) and Attitude as a third one.

#library(rms)
summary(lm(Points~Age+I(as.factor(Gender))*Attitude, data=learning2014))
## 
## Call:
## lm(formula = Points ~ Age + I(as.factor(Gender)) * Attitude, 
##     data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.3457  -3.3677   0.1811   3.8529  10.2281 
## 
## Coefficients:
##                                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)                    11.21591    4.18555   2.680 0.008135 ** 
## Age                            -0.07516    0.05379  -1.397 0.164246    
## I(as.factor(Gender))2           2.84438    4.45715   0.638 0.524274    
## Attitude                        0.41480    0.11115   3.732 0.000263 ***
## I(as.factor(Gender))2:Attitude -0.07583    0.13155  -0.576 0.565115    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.326 on 161 degrees of freedom
## Multiple R-squared:  0.2034, Adjusted R-squared:  0.1836 
## F-statistic: 10.28 on 4 and 161 DF,  p-value: 1.944e-07

Apparently age and gender have no statistical effect in the model, let’s keep only attitude.

summary(lm(Points~Attitude, data=learning2014))
## 
## Call:
## lm(formula = Points ~ Attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.63715    1.83035   6.358 1.95e-09 ***
## Attitude     0.35255    0.05674   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

As attitude has ambiguous dimensions, let’s mean scale it (mean=0, sd=1) for easier interpretation.

summary(lm(Points~I(scale(Attitude)), data=learning2014))
## 
## Call:
## lm(formula = Points ~ I(scale(Attitude)), data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -16.9763  -3.2119   0.4339   4.1534  10.6645 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)         22.7169     0.4129  55.019  < 2e-16 ***
## I(scale(Attitude))   2.5733     0.4141   6.214 4.12e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.32 on 164 degrees of freedom
## Multiple R-squared:  0.1906, Adjusted R-squared:  0.1856 
## F-statistic: 38.61 on 1 and 164 DF,  p-value: 4.119e-09

Now we can say that Attitude is highly significant in this model (p=4e-9). It’s effect (beta) is 2.5733, which means that one standard deviation increase in Attitude increases Points by apx. 2.57.

Multiple R-squared is 0.1906 which means that Attitude alone explains 19% of variability of Points.

Model diagnostics

plot(lm(Points~Attitude, data=learning2014))

Residuals do not seem to have any non-linear patterns. QQ-plot indicates that residuals are approximately normally distributed which further confirms that linear model assumptions are met. Scale-location points shows no apparent heteroscedasticity although cases 145, 56, and 36 should potentially be further investigated as interesting outlier cases. However, none of these cases (or any other cases), fall beyond Cook’s distance in the fourth plot. If they would, it would mean they probably should be excluded from the analysis as outliers that influence model too much.

In summary, we can conclude, that assumptions of linear model are probably very well met.


Chapter 3: Logistic regression

library(data.table)

# Read in learning2014 created with create_learning2014.R
alc<-fread("data/alc.txt")
colnames(alc)
##  [1] "school"     "sex"        "age"        "address"    "famsize"   
##  [6] "Pstatus"    "Medu"       "Fedu"       "Mjob"       "Fjob"      
## [11] "reason"     "nursery"    "internet"   "guardian"   "traveltime"
## [16] "studytime"  "failures"   "schoolsup"  "famsup"     "paid"      
## [21] "activities" "higher"     "romantic"   "famrel"     "freetime"  
## [26] "goout"      "Dalc"       "Walc"       "health"     "absences"  
## [31] "G1"         "G2"         "G3"         "alc_use"    "high_use"

This “Student Performance Data Set” dataset is derived from UCI Machine learning repository. It’s described as:

“This data approach student achievement in secondary education of two Portuguese schools. The data attributes include student grades, demographic, social and school related features) and it was collected by using school reports and questionnaires. Two datasets are provided regarding the performance in two distinct subjects: Mathematics (mat) and Portuguese language (por). In [Cortez and Silva, 2008], the two datasets were modeled under binary/five-level classification and regression tasks. Important note: the target attribute G3 has a strong correlation with attributes G2 and G1. This occurs because G3 is the final year grade (issued at the 3rd period), while G1 and G2 correspond to the 1st and 2nd period grades. It is more difficult to predict G3 without G2 and G1, but such prediction is much more useful (see paper source for more details).”

We joined these two datasets (mat and por) together, overlapping variables were either averaged (quantitative) or the one from math selected (qualitative). Alc_use and high_use were calculated from the data.

Analysis

Let’s select four variables for the analysis:

  • Age (age is an usual suspect to be used as a covarate)
  • Sex (Gender as well, probably alcohol consumption more usual among men)
  • Mother’s education (Mother’s education is know to explain child’s education level)
  • Going out with friends (Naturally affects alcohol consumption)
vars<-c("age","sex","Medu","goout")
str(alc[,vars,with=F])
## Classes 'data.table' and 'data.frame':   382 obs. of  4 variables:
##  $ age  : int  18 17 15 15 16 16 16 17 15 15 ...
##  $ sex  : chr  "F" "F" "F" "F" ...
##  $ Medu : int  4 1 1 4 3 4 2 4 3 3 ...
##  $ goout: int  4 3 2 2 2 2 4 4 2 1 ...
##  - attr(*, ".internal.selfref")=<externalptr>

Age is a continuous variable, gender is dichotomous, mother’s education is multinomial, and going out is ordinal/continuous (likert).

library(ggplot2)
library(GGally)
#ggpairs(alc, columns=c(vars,"high_use"))

1. Age

library(ggridges)
library(magrittr)
library(dplyr)
ggplot(data=alc, aes(x=high_use, y=age)) + geom_boxplot()

# Differenes in mean and standard deviation
alc %>% group_by(high_use) %>% summarise(mean_age=mean(age), sd_age=sd(age))
## # A tibble: 2 x 3
##   high_use mean_age sd_age
##   <lgl>       <dbl>  <dbl>
## 1 FALSE        16.5   1.15
## 2 TRUE         16.8   1.21
# Univariate non-parametric test
wilcox.test(age~high_use, data=alc)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  age by high_use
## W = 13266, p-value = 0.05185
## alternative hypothesis: true location shift is not equal to 0

Distribution seems to be somewhat skewed, thus wilcox.test (mann-whitney). It indicates assocation, although not very strong: Older students are likely to have higher consumption which makes intuitive sense.

2. Gender

ggplot(data=alc, aes(x=sex, fill=high_use)) + geom_bar()

table(alc$high_use, alc$sex)
##        
##           F   M
##   FALSE 157 113
##   TRUE   41  71
chisq.test(x=alc$high_use, y=alc$sex)
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  alc$high_use and alc$sex
## X-squared = 13.863, df = 1, p-value = 0.0001967

Gender seems to have strong univariate association as anticipated.

3. Mother’s education

ggplot(data=alc, aes(x=Medu, fill=high_use)) + geom_bar()

table(alc$high_use, alc$Medu)
##        
##          0  1  2  3  4
##   FALSE  1 33 80 59 97
##   TRUE   2 18 18 36 38
chisq.test(y=alc$high_use, x=as.factor(alc$Medu))
## Warning in chisq.test(y = alc$high_use, x = as.factor(alc$Medu)): Chi-
## squared approximation may be incorrect
## 
##  Pearson's Chi-squared test
## 
## data:  as.factor(alc$Medu) and alc$high_use
## X-squared = 12.031, df = 4, p-value = 0.01713

Mother’s education seems to associate slightly which we expected a priori.

4. Going out with friends (1-5)

ggplot(data=alc, aes(x=high_use, y=goout)) + geom_boxplot()

# Differenes in mean and standard deviation
alc %>% group_by(high_use) %>% summarise(mean_age=mean(goout), sd_age=sd(goout))
## # A tibble: 2 x 3
##   high_use mean_age sd_age
##   <lgl>       <dbl>  <dbl>
## 1 FALSE        2.86   1.04
## 2 TRUE         3.73   1.10
# Univariate non-parametric test
wilcox.test(goout~high_use, data=alc)
## 
##  Wilcoxon rank sum test with continuity correction
## 
## data:  goout by high_use
## W = 8577, p-value = 5.88e-12
## alternative hypothesis: true location shift is not equal to 0

“Going out” indiciates a strong univariate association, and makes naively sense.

Logistic regression

# Let's use Harrell's excellent regression model strategies package
library(rms)
# Let's make sure Medu is factor and scale likert scale "going out" (center to zero, very high=1, very low=-1)
alc[,Medu := as.factor(Medu)]
alc[,goout := (goout - 3)/2]

lrm(high_use~age+sex+Medu+goout, data=alc)
## Logistic Regression Model
##  
##  lrm(formula = high_use ~ age + sex + Medu + goout, data = alc)
##  
##                        Model Likelihood     Discrimination    Rank Discrim.    
##                           Ratio Test           Indexes           Indexes       
##  Obs           382    LR chi2      78.20    R2       0.264    C       0.760    
##   FALSE        270    d.f.             7    g        1.301    Dxy     0.519    
##   TRUE         112    Pr(> chi2) <0.0001    gr       3.672    gamma   0.523    
##  max |deriv| 6e-12                          gp       0.226    tau-a   0.216    
##                                             Brier    0.163                     
##  
##            Coef    S.E.   Wald Z Pr(>|Z|)
##  Intercept -0.2993 2.2897 -0.13  0.8960  
##  age        0.0703 0.1094  0.64  0.5204  
##  sex=M      0.9357 0.2551  3.67  0.0002  
##  Medu=1    -2.0598 1.2909 -1.60  0.1106  
##  Medu=2    -3.1142 1.2888 -2.42  0.0157  
##  Medu=3    -2.0764 1.2741 -1.63  0.1032  
##  Medu=4    -2.6082 1.2745 -2.05  0.0407  
##  goout      1.5668 0.2468  6.35  <0.0001 
## 
confint.default(lrm(high_use~age+sex+Medu+goout, data=alc))
##                2.5 %     97.5 %
## Intercept -4.7870567  4.1883960
## age       -0.1441360  0.2847951
## sex=M      0.4356889  1.4357720
## Medu=1    -4.5898916  0.4702347
## Medu=2    -5.6401508 -0.5881813
## Medu=3    -4.5736251  0.4209207
## Medu=4    -5.1061906 -0.1101853
## goout      1.0831240  2.0504201
ggcoef(glm(high_use~age+sex+Medu+goout, data=alc, family="binomial"), exponentiate = F, errorbar_height=.2, color="blue", sort="ascending")

Results seem quite interesting (below p-value, OR and 95% CI respectively):

  • Age alone seems to associate alone, but when combined with other variables to this model, association seems to disappear (p=0.52, OR=1.07 [0.87-1.33]). Other stronger variables probably explain its effect away (e.g. going out is more likely among older students).
  • Gender associates heavily (for men: p=0.0002, OR=2.55 [1.55-4.20] compared to females)
  • Mother’s education explains high consumption also. Interestingly, having a mother with only primary education completed (Medu=2) seems to be more strongly protective than having a mother with a higher level education. (For Medu=2: p=0.016, OR=0.04 [0.004-0.56], constrasted to mothers without any education)
  • Going out with friends. Associated strongly with higher alcohol usage (for very high: p<0.0001, OR=4.79 [2.95-7.77], in contrast to mediocre going out level)

Predictive performance

# Remove age from the model (p-value was low)
mod2<-lrm(high_use~sex+Medu+goout, data=alc, x=T, y=T)

# Get predictions an threshold at zero
preds<-as.numeric(predict(mod2)>0)

# Confusion matrix
cmat<-table(preds, as.numeric(alc$high_use))

cmat
##      
## preds   0   1
##     0 254  68
##     1  16  44
# Training error rate (proportion is incorrectly classified individuals)
((cmat[1,2]+cmat[2,1])/sum(cmat))
## [1] 0.2198953

Error rate (22%) is better than randomly guessing (50%). However, this is likely rather optimistic as it is in-sample error rate.

Cross-validating

We can do cross-validation natively with rms package.

validate(mod2, method="crossvalidation", B=10)
##           index.orig training    test optimism index.corrected  n
## Dxy           0.5188   0.5200  0.4691   0.0509          0.4679 10
## R2            0.2625   0.2653  0.2021   0.0632          0.1993 10
## Intercept     0.0000   0.0000 -0.1931   0.1931         -0.1931 10
## Slope         1.0000   1.0000  0.8200   0.1800          0.8200 10
## Emax          0.0000   0.0000  0.0805   0.0805          0.0805 10
## D             0.2010   0.2031  0.1304   0.0728          0.1282 10
## U            -0.0052  -0.0058  0.0239  -0.0297          0.0245 10
## Q             0.2062   0.2089  0.1065   0.1024          0.1038 10
## B             0.1634   0.1629  0.1731  -0.0103          0.1736 10
## g             1.2971   1.3098  1.0661   0.2437          1.0535 10
## gp            0.2252   0.2258  0.1876   0.0382          0.1870 10

Which shows optimism in original model without cross validation (e.g. in Somer’s D)

We can also include all variables to the model and perform backwards stepwise regression in rms package:

# Add all variables to the model (excluding those that were used to create the outcome variable)
mod3<-lrm(high_use~., data=alc[,-c("alc_use","Dalc","Walc"),with=F],x=T, y=T)

validate(mod3, method="crossvalidation", B=10, bw=T, pr=F)
## 
##      Backwards Step-down - Original Model
## 
##  Deleted    Chi-Sq d.f. P      Residual d.f. P      AIC   
##  Mjob        4.57  4    0.3339  4.57     4   0.3339  -3.43
##  Fjob        4.59  4    0.3316  9.17     8   0.3284  -6.83
##  internet    0.00  1    0.9707  9.17     9   0.4219  -8.83
##  schoolsup   0.01  1    0.9086  9.18    10   0.5150 -10.82
##  G3          0.03  1    0.8710  9.21    11   0.6027 -12.79
##  higher      0.05  1    0.8278  9.26    12   0.6810 -14.74
##  Pstatus     0.05  1    0.8156  9.31    13   0.7492 -16.69
##  age         0.08  1    0.7816  9.39    14   0.8055 -18.61
##  failures    0.18  1    0.6746  9.56    15   0.8463 -20.44
##  Fedu        0.21  1    0.6490  9.77    16   0.8784 -22.23
##  famsup      0.37  1    0.5408 10.14    17   0.8975 -23.86
##  reason      4.56  3    0.2071 14.70    20   0.7931 -25.30
##  school      0.31  1    0.5795 15.01    21   0.8224 -26.99
##  romantic    0.46  1    0.4994 15.47    22   0.8415 -28.53
##  nursery     0.50  1    0.4781 15.97    23   0.8566 -30.03
##  freetime    0.74  1    0.3884 16.71    24   0.8606 -31.29
##  famsize     0.99  1    0.3186 17.71    25   0.8545 -32.29
##  guardian    2.99  2    0.2247 20.69    27   0.8004 -33.31
##  health      1.66  1    0.1975 22.35    28   0.7646 -33.65
##  traveltime  2.10  1    0.1469 24.46    29   0.7060 -33.54
##  Medu        8.39  4    0.0783 32.85    33   0.4746 -33.15
##  paid        1.82  1    0.1768 34.67    34   0.4357 -33.33
##  studytime   2.71  1    0.0995 37.39    35   0.3600 -32.61
##  G2          3.14  1    0.0762 40.53    36   0.2772 -31.47
##  G1          0.68  1    0.4103 41.21    37   0.2916 -32.79
##  address     3.76  1    0.0524 44.97    38   0.2030 -31.03
##  activities  3.88  1    0.0490 48.85    39   0.1341 -29.15
##  famrel      5.79  1    0.0161 54.64    40   0.0613 -25.36
##  sex         6.38  1    0.0115 61.02    41   0.0228 -20.98
##  absences    6.59  1    0.0102 67.61    42   0.0074 -16.39
##  goout      15.15  1    0.0001 82.76    43   0.0003  -3.24
## 
## Approximate Estimates after Deleting Factors
## 
##         Coef   S.E. Wald Z        P
## [1,] -0.6665 0.1432 -4.653 3.27e-06
## 
## Factors in Final Model
## 
## None
##           index.orig training    test optimism index.corrected  n
## Dxy           0.0000   0.0000  0.0000   0.0000          0.0000 10
## R2            0.0000   0.0000  0.0000   0.0000          0.0000 10
## Intercept     0.0000   0.0000 -0.9049   0.9049         -0.9049 10
## Slope         1.0000   1.0000  1.0000   0.0000          1.0000 10
## Emax          0.0000   0.0000  0.2224   0.2224          0.2224 10
## D            -0.0026  -0.0029 -0.0263   0.0234         -0.0260 10
## U            -0.0052  -0.0058 -0.0263   0.0205         -0.0257 10
## Q             0.0026   0.0029  0.0000   0.0029         -0.0003 10
## B             0.2072   0.2072  0.2078  -0.0006          0.2078 10
## g             0.0000   0.0000  0.0000   0.0000          0.0000 10
## gp            0.0000   0.0000  0.0000   0.0000          0.0000 10
## 
## Factors Retained in Backwards Elimination
## 
##  school sex age address famsize Pstatus Medu Fedu Mjob Fjob reason nursery
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##  internet guardian traveltime studytime failures schoolsup famsup paid
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##                                                                       
##  activities higher romantic famrel freetime goout health absences G1 G2 G3
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##                                                                           
##                                                                           
## 
## Frequencies of Numbers of Factors Retained
## 
##  0 
## 10

Chapter 4: Clustering and classification

Housing values in Suburbs of Boston

set.seed(12345)
library(MASS)
library(dplyr)
data(Boston)
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...

Variable details:

  • crim: per capita crime rate by town.
  • zn: proportion of residential land zoned for lots over 25,000 sq.ft.
  • indus: proportion of non-retail business acres per town.
  • chas: Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  • nox: nitrogen oxides concentration (parts per 10 million).
  • rm: average number of rooms per dwelling.
  • age: proportion of owner-occupied units built prior to 1940.
  • dis: weighted mean of distances to five Boston employment centres.
  • rad: index of accessibility to radial highways.
  • tax: full-value property-tax rate per $10,000.
  • ptratio: pupil-teacher ratio by town.
  • black: 1000(Bk - 0.63)^2 where Bk is the proportion of blacks by town.
  • lstat: lower status of the population (percent).
  • medv: median value of owner-occupied homes in $1000s.
library(GGally)
library(corrplot)
corrplot(cor(Boston, method="spearman"), method = "circle")

newBost<-Boston
newBost$chas<-as.factor(newBost$chas)
newBost$rad<-as.factor(newBost$rad)
ggpairs(newBost)

Many variables seem to have non-normal distribution, many seem even heavy-tailed and few binomial (having two modes, i.e. peaks, such as indus, tax, and rad). Also strong positive and negative correlations are present among several variables.

fLet’s plot density of each variable separately:

par(mfrow=c(4,4), oma = c(5,4,0,0) + 0.1,
          mar = c(0,0,1,1) + 0.1)
tmp<-lapply(1:ncol(Boston), function(i) plot(density(as.numeric(Boston[,i])), main=colnames(Boston)[i]))

And next normalize (scale) these variables to have zero mean and standard deviation of one. This can be somewhat questionable for e.g. heavy-tailed distributions where variance and standard devation may not be defined at all, but it’s requested by the assignment instructions.

Let’s scale and plot new variables (mean has been moved to zero and standard devation scaled to one):

# Scale original variables
newBost<-Boston
newBost<-(apply(newBost, 2, function(col) {
  if (is.numeric(col)) scale(col)
}))

# Plot scaled variable
par(mfrow=c(4,4), oma = c(5,4,0,0) + 0.1,
          mar = c(0,0,1,1) + 0.1)
tmp<-lapply(1:ncol(newBost), function(i) plot(density(as.numeric(Boston[,i])), main=colnames(newBost)[i]))

newBost<-as.data.frame(newBost)

Create a factor with quantiles of crim (and replace original variable). Let’s also do a 80:20 split of the dataset, by creating a holdout indicator.

newBost$crim<-with(newBost, cut(crim, quantile(crim, c(0, 0.25, 0.5, 0.75, 1)), labels = c("Q1", "Q2", "Q3", "Q4")))

holdout<-rbinom(n = nrow(newBost), size = 1, prob = 0.2)
print(table(holdout))
## holdout
##   0   1 
## 393 113
trainset<-newBost[holdout==0,]
testset<-newBost[holdout==1,]

LDA

Estimate model using trainset, predict in testset (remove original) and compare to original crim in testset. Crosstabulate predictions with original values.

library(MASS)
mod1<-lda(crim~., data=trainset)

test_crim<-testset$crim
testset$crim<-predict(mod1, newdata=testset)$class

table(test_crim, testset$crim)
##          
## test_crim Q1 Q2 Q3 Q4
##        Q1 13 17  3  0
##        Q2  3 13 10  0
##        Q3  0  5 25  0
##        Q4  0  0  0 24
cat(paste("Test error rate:", round(mean(test_crim != testset$crim)*100, 1), "%"))
## Test error rate: 33.6 %

K-means clustering

Reload boston and rescale. Sum of squares to find right number of clusters:

# Reload Boston and rescale
newBost<-Boston
newBost<-(apply(newBost, 2, function(col) {
  if (is.numeric(col)) scale(col)
}))

wss <- sapply(1:20, 
              function(k){kmeans(newBost, k, nstart=50, iter.max = 15)$tot.withinss})
wss
##  [1] 7070.000 4576.681 3871.374 3445.946 3074.612 2723.217 2436.911
##  [8] 2248.218 2064.030 1940.011 1849.925 1729.998 1659.495 1598.493
## [15] 1544.558 1476.380 1426.614 1385.056 1333.214 1286.531
plot(1:20, wss,
     type="b", pch = 19, frame = FALSE, 
     xlab="Number of clusters K",
     ylab="Total within-clusters sum of squares")

This approach does’t seem to give any insight. Let’s try Gap statistic:

library(cluster)
cg<-clusGap(newBost, kmeans, 10, B = 100)
plot(cg)

This approach suggest nine (although increasing number to significantly higher levels could be argued). Let’s build a kmeans clustering model with nine clusters and plot that:

km1<-kmeans(newBost, 9)
ggpairs(as.data.frame(newBost), aes(col=as.factor(km1$cluster)))


Chapter 5: Dimensionality reduction techniques

Exploring dataset

Read in the data and visualize with ggpairs and spearman correlations.

library(data.table)
library(GGally)
library(corrplot)
library(dplyr)
human<-read.table("http://s3.amazonaws.com/assets.datacamp.com/production/course_2218/datasets/human2.txt", row.names=1, header=T, sep=",")
summary(human)
##     Edu2.FM          Labo.FM          Edu.Exp         Life.Exp    
##  Min.   :0.1717   Min.   :0.1857   Min.   : 5.40   Min.   :49.00  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:11.25   1st Qu.:66.30  
##  Median :0.9375   Median :0.7535   Median :13.50   Median :74.20  
##  Mean   :0.8529   Mean   :0.7074   Mean   :13.18   Mean   :71.65  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:15.20   3rd Qu.:77.25  
##  Max.   :1.4967   Max.   :1.0380   Max.   :20.20   Max.   :83.50  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
ggpairs(human)

corrplot(cor(human, method="spearman"))

All variables seem to be unimodal and most have skewed non-normal distributions, some with heavy tails. Potentially only Edu.Exp could be approximated properly with a normal distribution. Correlation among Edu2FM, Edu.Exp, Life.Exp, GNI, Mat.Mor, and Ado.Birth seem high (both positive and negative directions). Labo.FM and Parli.F do not seem to strongly correlate with other variables, and they’re conceptually also different, they describe society level differences, whereas prior variables are conceptually individual level.

PCA

Next we were asked to perform a PCA and draw a biplot using raw and then with scaled data. (Scaling non-normal distributions using standard deviations and mean can be considered somehwat questionnable, but let’s follow the assignment).

pc<-princomp(human)
biplot(pc, cex=c(0.3, 1), main="Non-scaled PCA")

pc<-princomp(scale(human))
biplot(pc, cex=c(0.3, 1), main="Scaled PCA", xlab="PC1, female opportunity dimension (individual)", ylab="PC2, general female participation dimension (society level)")

barplot(pc$sdev)

summary(pc)
## Importance of components:
##                           Comp.1    Comp.2     Comp.3     Comp.4
## Standard deviation     2.0641471 1.1360379 0.87222124 0.77634649
## Proportion of Variance 0.5360463 0.1623703 0.09571374 0.07582845
## Cumulative Proportion  0.5360463 0.6984166 0.79413031 0.86995876
##                            Comp.5     Comp.6     Comp.7     Comp.8
## Standard deviation     0.65981755 0.53457325 0.45751637 0.32119948
## Proportion of Variance 0.05477328 0.03595302 0.02633506 0.01297988
## Cumulative Proportion  0.92473204 0.96068506 0.98702012 1.00000000

The non-scaled PCA seems to be highly driven by GNI because it has large absolute variance compared to other variables. Thus, some kind of scaling makes sense, to give all variables same original weight to start with.

In scaled PCA, x-dimension seems to gather variables that are related to individual level freedom / opportunities of women (explains 53% of variance). The y-axis seems to map variables that indicate general society participation level (i.e. politics) of women (explains 16% of variance). Together these two first PCs explain nearly 70% of all variance.

MCA and Tea data

library(FactoMineR)
data(tea)
glimpse(tea)
## Observations: 300
## Variables: 36
## $ breakfast        <fct> breakfast, breakfast, Not.breakfast, Not.brea...
## $ tea.time         <fct> Not.tea time, Not.tea time, tea time, Not.tea...
## $ evening          <fct> Not.evening, Not.evening, evening, Not.evenin...
## $ lunch            <fct> Not.lunch, Not.lunch, Not.lunch, Not.lunch, N...
## $ dinner           <fct> Not.dinner, Not.dinner, dinner, dinner, Not.d...
## $ always           <fct> Not.always, Not.always, Not.always, Not.alway...
## $ home             <fct> home, home, home, home, home, home, home, hom...
## $ work             <fct> Not.work, Not.work, work, Not.work, Not.work,...
## $ tearoom          <fct> Not.tearoom, Not.tearoom, Not.tearoom, Not.te...
## $ friends          <fct> Not.friends, Not.friends, friends, Not.friend...
## $ resto            <fct> Not.resto, Not.resto, resto, Not.resto, Not.r...
## $ pub              <fct> Not.pub, Not.pub, Not.pub, Not.pub, Not.pub, ...
## $ Tea              <fct> black, black, Earl Grey, Earl Grey, Earl Grey...
## $ How              <fct> alone, milk, alone, alone, alone, alone, alon...
## $ sugar            <fct> sugar, No.sugar, No.sugar, sugar, No.sugar, N...
## $ how              <fct> tea bag, tea bag, tea bag, tea bag, tea bag, ...
## $ where            <fct> chain store, chain store, chain store, chain ...
## $ price            <fct> p_unknown, p_variable, p_variable, p_variable...
## $ age              <int> 39, 45, 47, 23, 48, 21, 37, 36, 40, 37, 32, 3...
## $ sex              <fct> M, F, F, M, M, M, M, F, M, M, M, M, M, M, M, ...
## $ SPC              <fct> middle, middle, other worker, student, employ...
## $ Sport            <fct> sportsman, sportsman, sportsman, Not.sportsma...
## $ age_Q            <fct> 35-44, 45-59, 45-59, 15-24, 45-59, 15-24, 35-...
## $ frequency        <fct> 1/day, 1/day, +2/day, 1/day, +2/day, 1/day, 3...
## $ escape.exoticism <fct> Not.escape-exoticism, escape-exoticism, Not.e...
## $ spirituality     <fct> Not.spirituality, Not.spirituality, Not.spiri...
## $ healthy          <fct> healthy, healthy, healthy, healthy, Not.healt...
## $ diuretic         <fct> Not.diuretic, diuretic, diuretic, Not.diureti...
## $ friendliness     <fct> Not.friendliness, Not.friendliness, friendlin...
## $ iron.absorption  <fct> Not.iron absorption, Not.iron absorption, Not...
## $ feminine         <fct> Not.feminine, Not.feminine, Not.feminine, Not...
## $ sophisticated    <fct> Not.sophisticated, Not.sophisticated, Not.sop...
## $ slimming         <fct> No.slimming, No.slimming, No.slimming, No.sli...
## $ exciting         <fct> No.exciting, exciting, No.exciting, No.exciti...
## $ relaxing         <fct> No.relaxing, No.relaxing, relaxing, relaxing,...
## $ effect.on.health <fct> No.effect on health, No.effect on health, No....

All variables seem to be multinomial / categorical apart from age variable, and most are not ordered. The proper way to assess their relationships is not through correlations but using e.g. Cramer’s V:

library(vcd)
tmp<-lapply(1:ncol(tea), function(i) sapply(1:ncol(tea), function(j) {assocstats(table(tea[,i], tea[,j]))$cramer}) )
kor<-as.matrix(as.data.frame(tmp))
rownames(kor)<-colnames(tea)
colnames(kor)<-colnames(tea)
corrplot(kor, order="hclust", method="square")

We can see that age seems to be related to all other variables. Also where, price, and how form correlated cluster as well as breakfast and frequency.

Let’s conduct multiple correspondence analysis. Data description says: “The data used here concern a questionnaire on tea. We asked to 300 individuals how they drink tea (18 questions), what are their product’s perception (12 questions) and some personal details (4 questions).”

Furhtermore, “The first 18 questions are active ones, the 19th is a supplementary quantitative variable (the age) and the last variables are supplementary categorical variables.”

Thus, let’s set last variables (starting from 19) as supplementary ones (age, col 19, as quantitative).

library(factoextra)

mod1<-MCA(tea,quanti.sup=19,quali.sup=20:36, graph=F)
fviz_screeplot(mod1, addlabels = TRUE, ylim = c(0, 45))

fviz_mca_biplot(mod1, 
               repel = F, # Avoid text overlapping (slow if many point)
               ggtheme = theme_minimal())

fviz_mca_var(mod1, choice = "mca.cor", 
            repel = TRUE, # Avoid text overlapping (slow)
            ggtheme = theme_minimal())

# Contributions of rows to dimension 1
fviz_contrib(mod1, choice = "var", axes = 1, top = 15)

# Contributions of rows to dimension 2
fviz_contrib(mod1, choice = "var", axes = 2, top = 15)

# Total contribution to dimension 1 and 2
fviz_contrib(mod1, choice = "var", axes = 1:2, top = 15)

Dim1: 10 Most correlated variables

head(dimdesc(mod1, axes = c(1,2))[[1]]$quali, 10)
##                 R2      p.value
## where    0.4179301 1.255462e-35
## tearoom  0.3718911 6.082138e-32
## how      0.2988286 1.273180e-23
## friends  0.2431995 8.616289e-20
## resto    0.2264676 2.319804e-18
## tea.time 0.1920380 1.652462e-15
## price    0.2160938 4.050469e-14
## pub      0.1472236 5.846592e-12
## work     0.1115359 3.000872e-09
## How      0.1028519 4.796010e-07

Dim2: 10 Most correlated variables

head(dimdesc(mod1, axes = c(1,2))[[2]]$quali, 10)
##                R2      p.value
## where  0.62550194 4.542155e-64
## price  0.56056797 1.837909e-50
## how    0.51288621 4.103156e-47
## Tea    0.16034278 5.359827e-12
## resto  0.05883014 2.165287e-05
## age_Q  0.07663110 9.613084e-05
## dinner 0.04764166 1.385133e-04
## work   0.04334283 2.825934e-04
## sugar  0.03078909 2.286813e-03
## How    0.04300447 4.565763e-03

Chapter 6: Analysis of longitudinal data

Chapter 8 of MABS using the RATS

Read in the data (from the RDS object that retains the factor type definition):

library(data.table)
library(ggplot2)
library(dplyr)
library(magrittr)
RATSL<-readRDS("data/RATSL.rds")

Glimpse the data

glimpse(RATSL)
## Observations: 176
## Variables: 5
## $ ID     <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, ...
## $ Group  <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1, 1...
## $ COL    <fct> WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, ...
## $ Weight <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 445, ...
## $ Time   <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8, 8...
head(RATSL)
##    ID Group COL Weight Time
## 1:  1     1 WD1    240    1
## 2:  2     1 WD1    225    1
## 3:  3     1 WD1    245    1
## 4:  4     1 WD1    260    1
## 5:  5     1 WD1    255    1
## 6:  6     1 WD1    260    1
tail(RATSL)
##    ID Group  COL Weight Time
## 1: 11     2 WD64    472   64
## 2: 12     2 WD64    628   64
## 3: 13     3 WD64    525   64
## 4: 14     3 WD64    559   64
## 5: 15     3 WD64    548   64
## 6: 16     3 WD64    569   64

So we have rats with multiple observations and each one belongs to a one of three treatment groups. Next let’s plot weight of each rat in each groups separately as a time series:

Plot groups separately

p1 <- ggplot(RATSL, aes(x = Time, y = Weight, linetype = ID))
p2 <- p1 + geom_line() + scale_linetype_manual(values = rep(1:10, times=4))
p3 <- p2 + facet_grid(. ~ Group, labeller = label_both)
p4 <- p3 + theme_bw() + theme(legend.position = "none")
p5 <- p4 + theme(panel.grid.minor.y = element_blank())
p6 <- p5 + scale_y_continuous(limits = c(min(RATSL$Weight), max(RATSL$Weight)))
p6

We can already see that there is a increasing trend in all groups and already a significant baseline differencies among groups.

Standardise the scores and replot

RATSL <- RATSL %>%
   group_by(Time) %>%
   mutate( stdweight = (Weight - mean(Weight))/sd(Weight) ) %>%
   ungroup()
glimpse(RATSL)
## Observations: 176
## Variables: 6
## $ ID        <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ Group     <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3, 1, 1...
## $ COL       <fct> WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD1, WD...
## $ Weight    <int> 240, 225, 245, 260, 255, 260, 275, 245, 410, 405, 44...
## $ Time      <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 8, 8...
## $ stdweight <dbl> -1.0011429, -1.1203857, -0.9613953, -0.8421525, -0.8...
p1 <- ggplot(RATSL, aes(x = Time, y = stdweight, linetype = ID))
p2 <- p1 + geom_line() + scale_linetype_manual(values = rep(1:10, times=4))
p3 <- p2 + facet_grid(. ~ Group, labeller = label_both)
p4 <- p3 + theme_bw() + theme(legend.position = "none")
p5 <- p4 + theme(panel.grid.minor.y = element_blank())
p6 <- p5 + scale_y_continuous(name = "standardized weight")
p6

After standardizing, we can see some differences in trends: Group 1 seems to be quite steady, group 2 mostly increasing weigth (with one exception) and group 3 generally has a decreasing trend.

Number of weeks, baseline (week 0) included:

n <- RATSL$Time %>% unique() %>% length()
# Make a summary data:
RATSS <- RATSL %>%
   group_by(Group, Time) %>%
   summarise( mean=mean(Weight), se=sd(Weight)/sqrt(n) ) %>%
   ungroup()
glimpse(RATSS)
## Observations: 33
## Variables: 4
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2,...
## $ Time  <int> 1, 8, 15, 22, 29, 36, 43, 44, 50, 57, 64, 1, 8, 15, 22, ...
## $ mean  <dbl> 250.625, 255.000, 254.375, 261.875, 264.625, 265.000, 26...
## $ se    <dbl> 4.589478, 3.947710, 3.460116, 4.100800, 3.333956, 3.5529...
p1 <- ggplot(RATSS, aes(x = Time, y = mean, linetype = Group, shape = Group))
p2 <- p1 + geom_line() + scale_linetype_manual(values = c(1,2,3))
p3 <- p2 + geom_point(size=3) + scale_shape_manual(values = c(1,2,3))
p4 <- p3 + geom_errorbar(aes(ymin=mean-se, ymax=mean+se, linetype="1"), width=0.3)
p5 <- p4 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p6 <- p5 + theme(legend.position = c(0.8,0.8))
p7 <- p6 + scale_y_continuous(name = "mean(Weight) +/- se(Weight)")
p7

Groups have clear baseline differences. We can also see that variation inside groups is different (group 1 have very similar weights, group 2 has largest in-group differences). Generally all groups show increasing trend, group 2 maybe highest.

Boxplot to spot incosistent distributions

p1 <- ggplot(RATSL, aes(x = factor(Time), y = Weight, fill = Group))
p2 <- p1 + geom_boxplot(position = position_dodge(width = 0.9))
p3 <- p2 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p4 <- p3 + theme(legend.position = c(0.8,0.8))
p5 <- p4 + scale_x_discrete(name = "week")
# Black & White version:
#p6 <- p5 + scale_fill_grey(start = 0.5, end = 1)
p5

We can see that group 2 actually seems to have one outlier that causes mean to go higher than maybe reasonable. Also group 1 has one outlier that drags mean downwards. If we compare trends of group medians (that is more robust to utliers), they show quite similar increasing slopes.

Group differences and boxplots

# Make a summary data of the post treatment weeks (1-8)
RATSG <- RATSL %>%
   filter(Time > 0) %>%
   group_by(Group, ID) %>%
   summarise( mean=mean(Weight) ) %>%
   ungroup()
glimpse(RATSG)
## Observations: 16
## Variables: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 3, 3, 3, 3
## $ ID    <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16
## $ mean  <dbl> 261.0909, 237.6364, 260.1818, 266.5455, 269.4545, 274.72...
p1 <- ggplot(RATSG, aes(x = Group, y = mean))
p2 <- p1 + geom_boxplot()
p3 <- p2 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p4 <- p3 + stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white")
p5 <- p4 + scale_y_continuous(name = "mean(Weight)")
p5

Same outliers that we identified earlier show up here also. We also now find one outlier in group 3, that earlier could have been considered as an borderline case (appearing only on three weeks)

Remove outliers

RATSG1 <- RATSG %>%
   filter( (Group == 1 & mean > 240) |
             (Group == 2 & mean < 500)  |
             (Group == 3 & mean > 500))
glimpse(RATSG1)
## Observations: 13
## Variables: 3
## $ Group <fct> 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 3, 3, 3
## $ ID    <fct> 1, 3, 4, 5, 6, 7, 8, 9, 10, 11, 14, 15, 16
## $ mean  <dbl> 261.0909, 260.1818, 266.5455, 269.4545, 274.7273, 274.63...
p1 <- ggplot(RATSG1, aes(x = Group, y = mean))
p2 <- p1 + geom_boxplot()
p3 <- p2 + theme_bw() + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p4 <- p3 + stat_summary(fun.y = "mean", geom = "point", shape=23, size=4, fill = "white")
p5 <- p4 + scale_y_continuous(name = "mean(Weight)")
p5

Now that outliers are removed, variances inside groups look much better.

ANOVA without outliers

Without the outliers, let’s apply one-way analysis of variance (ANOVA is a generalization of t-test. T-test works only with only in case of two groups)

summary(aov(mean ~ Group, data = RATSG1))
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Group        2 175958   87979    2577 2.72e-14 ***
## Residuals   10    341      34                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

We see highly statistically highly significant difference among groups (p=2.7e-14) which was anticipated from the earlier plots already.

Include baseline

Next, add the baseline from the original data as a new variable to the summary data, and then remove outliers again.

RATS<-fread("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/rats.txt")[,-1,with=F] # RATS has row numbers, remove them
baseline <- RATS$WD1
RATSG2 <- RATSG %>% mutate(baseline) %>%
   filter( (Group == 1 & mean > 240) |
             (Group == 2 & mean < 500)  |
             (Group == 3 & mean > 500))

Fit the ANCOVA model and see the results

fit <- lm(mean ~ baseline + Group, data = RATSG2)
summary(fit)
## 
## Call:
## lm(formula = mean ~ baseline + Group, data = RATSG2)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.0310 -2.6287  0.1002  1.8269  7.1808 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 201.1903    25.7382   7.817 2.66e-05 ***
## baseline      0.2605     0.1010   2.580   0.0297 *  
## Group2      138.8380    17.0411   8.147 1.91e-05 ***
## Group3      199.6530    27.1916   7.342 4.36e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.669 on 9 degrees of freedom
## Multiple R-squared:  0.9989, Adjusted R-squared:  0.9985 
## F-statistic:  2692 on 3 and 9 DF,  p-value: 1.324e-13

Including baseline as covariate, we can see statistical significance among all three groups, but it’s not as huge as without baseline.

Chapter 9 of MABS using the BPRS

Read in the data (from the RDS object that retains the factor type definition):

BPRSL<-readRDS("data/BPRSL.rds")
glimpse(BPRSL)
## Observations: 360
## Variables: 5
## $ treatment <fct> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1...
## $ subject   <fct> 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 1...
## $ COL       <fct> week0, week0, week0, week0, week0, week0, week0, wee...
## $ bprs      <int> 42, 58, 54, 55, 72, 48, 71, 30, 41, 57, 30, 55, 36, ...
## $ Week      <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0...

Plot

p1 <- ggplot(BPRSL, aes(x = Week, y = bprs, group = subject))
p2 <- p1 + geom_text(aes(label = treatment))
p3 <- p2 + scale_x_continuous(name = "Time (week)", breaks = seq(0, 60, 10))
p4 <- p3 + scale_y_continuous(name = "BPRS")
p5 <- p4 + theme_bw()
p6 <- p5 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p6

Hard to make any conclusions, but general trend seems to be declining and slightly increasing at the end. This could indicate that we would need non-linear model to fit the data best, but let’s first proceed with linear models.

Regression

BPRS_reg <- lm(bprs ~ Week + treatment, data = BPRSL)
summary(BPRS_reg)
## 
## Call:
## lm(formula = bprs ~ Week + treatment, data = BPRSL)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -22.454  -8.965  -3.196   7.002  50.244 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  46.4539     1.3670  33.982   <2e-16 ***
## Week         -2.2704     0.2524  -8.995   <2e-16 ***
## treatment2    0.5722     1.3034   0.439    0.661    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 12.37 on 357 degrees of freedom
## Multiple R-squared:  0.1851, Adjusted R-squared:  0.1806 
## F-statistic: 40.55 on 2 and 357 DF,  p-value: < 2.2e-16

There is a statistically significant trend in time, but treatment doesn’t seem to associate.

Plot

p1 <- ggplot(BPRSL[order(Week)], aes(x = Week, y = bprs, group = interaction(subject, treatment)))
p2 <- p1 + geom_line() # + geom_line(aes(linetype = treatment)) # treatment given in middle
p3 <- p2 + scale_x_continuous(name = "Time (week)", breaks = seq(0, 60, 10))
p4 <- p3 + scale_y_continuous(name = "BPRS")
p5 <- p4 + theme_bw() + theme(legend.position = "top")
p6 <- p5 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p6

Line plot confirms this association with time, as well as pair plots below.

Pairs plot

BPRS<-fread("https://raw.githubusercontent.com/KimmoVehkalahti/MABS/master/Examples/data/BPRS.txt")
pairs(BPRS[, 3:11], cex = 0.7)

LMM, random intercept

Let’s analyze data now with linear mixed model and three different model definitions: random intercept, random slope and intercept, and random slope and random intercept including their interaction term.

library(lme4)
## Loading required package: Matrix
BPRS_ref <- lmer(bprs ~ Week + treatment + (1 | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ Week + treatment + (1 | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2748.7   2768.1  -1369.4   2738.7      355 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0481 -0.6749 -0.1361  0.4813  3.4855 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev.
##  subject  (Intercept)  47.41    6.885  
##  Residual             104.21   10.208  
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     1.9090  24.334
## Week         -2.2704     0.2084 -10.896
## treatment2    0.5722     1.0761   0.532
## 
## Correlation of Fixed Effects:
##            (Intr) Week  
## Week       -0.437       
## treatment2 -0.282  0.000

LMM, random slope and intercept

BPRS_ref1 <- lmer(bprs ~ Week + treatment + (Week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref1)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ Week + treatment + (Week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2745.4   2772.6  -1365.7   2731.4      353 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -2.8919 -0.6194 -0.0691  0.5531  3.7976 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 64.8202  8.0511        
##           Week         0.9608  0.9802   -0.51
##  Residual             97.4307  9.8707        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##             Estimate Std. Error t value
## (Intercept)  46.4539     2.1052  22.066
## Week         -2.2704     0.2977  -7.626
## treatment2    0.5722     1.0405   0.550
## 
## Correlation of Fixed Effects:
##            (Intr) Week  
## Week       -0.582       
## treatment2 -0.247  0.000
anova(BPRS_ref, BPRS_ref1)
## Data: BPRSL
## Models:
## BPRS_ref: bprs ~ Week + treatment + (1 | subject)
## BPRS_ref1: bprs ~ Week + treatment + (Week | subject)
##           Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## BPRS_ref   5 2748.7 2768.1 -1369.4   2738.7                           
## BPRS_ref1  7 2745.4 2772.6 -1365.7   2731.4 7.2721      2    0.02636 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Adding slope terms seems to make the model somewhat better (p=0.03).

LMM, random slope and intercept, intercation term

BPRS_ref2 <- lmer(bprs ~ Week * treatment + (Week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref2)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ Week * treatment + (Week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2744.3   2775.4  -1364.1   2728.3      352 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0512 -0.6271 -0.0767  0.5288  3.9260 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 65.0016  8.0624        
##           Week         0.9688  0.9843   -0.51
##  Residual             96.4699  9.8219        
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                 Estimate Std. Error t value
## (Intercept)      47.8856     2.2522  21.262
## Week             -2.6283     0.3589  -7.323
## treatment2       -2.2911     1.9090  -1.200
## Week:treatment2   0.7158     0.4010   1.785
## 
## Correlation of Fixed Effects:
##             (Intr) Week   trtmn2
## Week        -0.650              
## treatment2  -0.424  0.469       
## Wek:trtmnt2  0.356 -0.559 -0.840
anova(BPRS_ref1, BPRS_ref2)
## Data: BPRSL
## Models:
## BPRS_ref1: bprs ~ Week + treatment + (Week | subject)
## BPRS_ref2: bprs ~ Week * treatment + (Week | subject)
##           Df    AIC    BIC  logLik deviance  Chisq Chi Df Pr(>Chisq)  
## BPRS_ref1  7 2745.4 2772.6 -1365.7   2731.4                           
## BPRS_ref2  8 2744.3 2775.4 -1364.1   2728.3 3.1712      1    0.07495 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Adding interaction doesn’t seem to be so important (p=0.07) but probably doesn’t hurt either.

Plot fitted values from models

Fitted <- fitted(BPRS_ref2)
BPRSL1 <- BPRSL %>% mutate(Fitted)
p1 <- ggplot(BPRSL1, aes(x = Week, y = bprs, group = interaction(subject, treatment)))
p2 <- p1 + geom_line() #+ geom_line(aes(linetype = Group))
p3 <- p2 + scale_x_continuous(name = "Time (week)")
p4 <- p3 + scale_y_continuous(name = "BPRS")
p5 <- p4 + theme_bw() + theme(legend.position = "right") # "none" in the book
p6 <- p5 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p7 <- p6 + ggtitle("Observed")
graph1 <- p7

p1 <- ggplot(BPRSL1, aes(x = Week, y = Fitted, group = interaction(subject, treatment)))
p2 <- p1 + geom_line() #+ geom_line(aes(linetype = Group))
p3 <- p2 + scale_x_continuous(name = "Time (week)")
p4 <- p3 + scale_y_continuous(name = "BPRS")
p5 <- p4 + theme_bw() + theme(legend.position = "right")
p6 <- p5 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p7 <- p6 + ggtitle("Fitted")
graph2 <- p7

graph1; graph2

Fitted values seem to estimate general declining linear trend for all individuals. However, as we noted in the earlier first plot, the relationship is probably actually non-linear, as wee here that it increases at the end after first declining. Thus, depending on application, we might want to include second and third order polynomial term to account for this non-linearity:

BPRS_ref3 <- lmer(bprs ~ poly(Week,3)*treatment + (Week | subject), data = BPRSL, REML = FALSE)
summary(BPRS_ref3)
## Linear mixed model fit by maximum likelihood  ['lmerMod']
## Formula: bprs ~ poly(Week, 3) * treatment + (Week | subject)
##    Data: BPRSL
## 
##      AIC      BIC   logLik deviance df.resid 
##   2736.7   2783.4  -1356.4   2712.7      348 
## 
## Scaled residuals: 
##     Min      1Q  Median      3Q     Max 
## -3.0468 -0.5958 -0.0857  0.4910  3.9220 
## 
## Random effects:
##  Groups   Name        Variance Std.Dev. Corr 
##  subject  (Intercept) 65.865   8.116         
##           Week         1.007   1.003    -0.52
##  Residual             91.897   9.586         
## Number of obs: 360, groups:  subject, 20
## 
## Fixed effects:
##                            Estimate Std. Error t value
## (Intercept)                 37.3722     1.7074  21.888
## poly(Week, 3)1            -128.7615    17.4536  -7.377
## poly(Week, 3)2              12.4209    13.5571   0.916
## poly(Week, 3)3              25.1360    13.5571   1.854
## treatment2                   0.5722     1.0105   0.566
## poly(Week, 3)1:treatment2   35.0685    19.1726   1.829
## poly(Week, 3)2:treatment2   33.8452    19.1726   1.765
## poly(Week, 3)3:treatment2  -25.1862    19.1726  -1.314
## 
## Correlation of Fixed Effects:
##             (Intr) pl(W,3)1 pl(W,3)2 pl(W,3)3 trtmn2 p(W,3)1: p(W,3)2:
## poly(Wk,3)1 -0.017                                                    
## poly(Wk,3)2  0.000  0.000                                             
## poly(Wk,3)3  0.000  0.000    0.000                                    
## treatment2  -0.296  0.000    0.000    0.000                           
## ply(W,3)1:2  0.000 -0.549    0.000    0.000    0.000                  
## ply(W,3)2:2  0.000  0.000   -0.707    0.000    0.000  0.000           
## ply(W,3)3:2  0.000  0.000    0.000   -0.707    0.000  0.000    0.000
anova(BPRS_ref2, BPRS_ref3)
## Data: BPRSL
## Models:
## BPRS_ref2: bprs ~ Week * treatment + (Week | subject)
## BPRS_ref3: bprs ~ poly(Week, 3) * treatment + (Week | subject)
##           Df    AIC    BIC  logLik deviance Chisq Chi Df Pr(>Chisq)   
## BPRS_ref2  8 2744.3 2775.4 -1364.1   2728.3                           
## BPRS_ref3 12 2736.7 2783.4 -1356.4   2712.7 15.54      4   0.003703 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

It seems to perform much better on paper. Let’s also plot these fitted values with second and third order polynomial term:

Fitted <- fitted(BPRS_ref3)
BPRSL2 <- BPRSL %>% mutate(Fitted)
p1 <- ggplot(BPRSL2, aes(x = Week, y = bprs, group = interaction(subject, treatment)))
p2 <- p1 + geom_line() #+ geom_line(aes(linetype = Group))
p3 <- p2 + scale_x_continuous(name = "Time (week)")
p4 <- p3 + scale_y_continuous(name = "BPRS")
p5 <- p4 + theme_bw() + theme(legend.position = "right") # "none" in the book
p6 <- p5 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p7 <- p6 + ggtitle("Observed")
graph1 <- p7

p1 <- ggplot(BPRSL2, aes(x = Week, y = Fitted, group = interaction(subject, treatment)))
p2 <- p1 + geom_line() #+ geom_line(aes(linetype = Group))
p3 <- p2 + scale_x_continuous(name = "Time (week)")
p4 <- p3 + scale_y_continuous(name = "BPRS")
p5 <- p4 + theme_bw() + theme(legend.position = "right")
p6 <- p5 + theme(panel.grid.major = element_blank(), panel.grid.minor = element_blank())
p7 <- p6 + ggtitle("Fitted")
graph2 <- p7

graph1; graph2

=> Profit.